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ABSTRACT 



The operating characteristics of 114 mathematics 
pretest items from the Praxis I: Computer Based Test were analyzed in 
terms of item attributes and test developers' judgments of item 
difficulty. Item operating characteristics vere defined as the 
difficulty, discrimination, and asymptote parameters of a three 
parameter logistic item response theory (IRT) model. Three types of 
item attributes were considered: (1) surface features (for example, 
whether or not the item stem contained an equation); (2) aspects of 
the solution process (for example, whether or not the solution 
required application of a standard formula); and (3) response type 
(free-response or multiple-choice). Because the attribute set 
included large numbers of categorical variables, an approach based on 
binary regression trees (Breiman, Friedman, Olshen, and Stone, 1984) 
was implemented. The results were quite impressive for asymptote 
parameters (857. of variance explained), somewhat less so for 
difficulty parameters (367. of variance explained) and fairly 
unimpressive for discrimination parameters (only 127. of variance 
explained). In addition, the tree-based approach was found to be 
particularly useful for identifying important interaction effects and 
for developing graphical summaries of the modeling results. Six 
tables and eight figures support the analyses. (Contains 11 
references.) (Author/SLD) 
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A Tree-Based Analysis of Items From 
An Assessment of Basic Mathematics Skills 



The operating characteristics of 114 Mathematics pretest items from the 
Praxis I: Computer Based Test were analyzed in terms of item attributes and 
test developers' judgements of item difficulty. Item operating characteristics 
were defined as the difficulty, discrimination and asymptote parameters of a 
thre^ parameter logistic IRT model. Thre-e types of item attributes were 
considered: surface features (for example, whether or not the item stem 
contained an equation); aspects of the solution process (for example, whether 
or not the solution required application of a standard formula); and response 
type (free-response or multiple-choice). Because the attribute set included large 
numbers of categorical variables, an approach based on binary regression trees 
(Breiman, Friedman, Olshen, and Stone, 1984) was implemented. The results 
were quite impressive for asymptote parameters (85% of variance explained), 
somewhat less so for difficulty parameters (36% of variance explrined) and 
fairly unimpressive for discrimination parameters (only 12% of variance 
explained). In addition, the tree-based approach was found to be particularly 
useful for identifying important mteraction effects and for developing graphical 
summaries of the modeling results. 
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A Tree-Based Analysis of Items From 
An Assessment of Basic Mathematics Skills 



The goal of this study was to determine the degree to which tlie operating 
characteristics of basic mathematics achievement test items could be predicted from an 
analysis of item attributes and test developed' judgements of item difficulty. Items' 
operating characteristics were defined as the difQculty, discrimination and asymptote 
parameters of the three parameter logistic (3PL) IRT model. Three types of item attributes 
were considered: surface features of the items (for example, whether or not the item stem 
included an equation); aspects of the solution process (for example, whether or not the 
solution required application of a standard formula); and response format (free-response or 
multiple-choice). Studies of this type may be conducted for a variety of reasons including: 
(1) reducing s'jnple size requirements for item calibration (Mislevy, Sheehan & Wingersky, 
1993); (2) providing for more systematic test design and construct validation (Embretson & 
Wetzel, 1987; and Bejar, 1991); and (3) diagnosing students' misconceptions (Tatsuoka, 1987, 
1990). 

The analyses reported in this paper were conducted using a combination of least- 
squares regression analysis and binary regression trees (Breiman, Friedman, Olshen, and 
Stone, 1984). Regression analysis has been used in numerous studies of the components of 
item difficulty (see for example, Enright, Allen & Kim, 1993; Scheuneraan, Gerritz & 
Embretson, 1991; Sheehan & Mislevy, 1990; and Tatsuoka, 1987). This paper introduces 
tree-based models as an exploratory technique for determining the structure of the regression 
equation and for developirij^ graphical summaries of the modeling results. 

The Trae-Based Approach 

For problems involving a single numeric response (y) and a set of predictor variables 
(x) a binary regression tree is fit by successively splitting the data on the basis of the 
independent variables into binary subsets with similar values of the response variable. At 
each stage of model fitting, the splitting algorithm considers all possible splits of all possible 
predictor variables. When the potential predictor is a multi-level categorical variable, as was 
the case for several of the variables considered in this study, the splitting algorithm considers 
all collapsing strategies resulting in exactly two levels. When the potential predictor is a 
numeric vaiiable, such as the item difficulty rating considered in this study, the splitting 
algorithm considers all possible cut points for grouping the observations into low and high 
subsets. Potential splits are evaluated in terms of deviance, a statistical measure of the 
dissimilarity in the response variable among the observations belonging to a single subset. At 
each stage of splitting, the original subset of observations is referred to as the parent node and 
the two outcome subsets are referred to as the left and right child nodes. The best split is the 
one that produces the largest decrease between the deviance of the parent node and the sum 
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of the deviances in the two child nodes. The deviance of the parent node is calculated as the 
sum of the deviances of all of i^s members, 

where ^ is the mean value of the response calculated from all of the obervations in the node. 
The deviance of a potential split is calculated as 

L It 
L R 

where is the mean value of the response ia the left child node and is the mean value of 
the response in the right child node. The split that maximizes the change in deviance 

AD = D(y,y) - D(y,y^,yj,) 

is the split chosen at a given node. In the final fitted model, the predicted value for each 
observation is the mean response calculated from only those observations belonging to the 
same terminal node. 

Figure 1 provides a graphical representation of a tree model estimated for a 
hypothetical set of 20 observations. In this particular representation, the number of 
observations ia each node is plotted as the node label and the variables used to define each 
split are indicated on the lines connecting parents to children. Node locations indicate the 
predicted value of the rei.ponse variable (read from the horizontal axis) and the estimated 
deviance value (read from the vertical axis). 



Insert Figure 1 Here 



As can be seen, the model has two splits yielding a total of three terminal nodes. The 
first split divides the data into subsets based on values of the categorical variable Xj. 
Observations with values of Xj equal to A or B (denoted Xi=AB) are classified iato the left 
child node. Observations with values of Xj equal to C or D (denoted Xi=CD) are classified 
into the right child node. The horizontal distance between the left child node and the right 
child node is the amount by which the predicted response for observations of type A or B 
differs from the predicted response for observations of type C or D. The second split divides 
the set of ten observations in the Xi=AB node into subsets based on values of the second 
independent variable Xj. The six observations with Xi=AB and X2<=10 are classified into one 



subset; the four observations with Xi=AB and X2>10 are classified into a second subset There 
are no further splits of the x,=CD node indicating that was only helpful at predictog the 
value of y for observations with Xi=AB. This type of interaction is common in problems 
involving several independent predictors. The final fitted model is specified m terms of the 
foUowing three prediction rules (corresponding to the three terminal nodes m Figure 1): 

1 * 

if x,=AB and X, 1^10 then y = -Y,yi 
* "1-1 

10 



1 

ifx,=AB and x,>10 then y = 

* i'7 

1 ^ 

ifx,=CDtheny = 77: E ^i 



The various splits shown in Figure 1 represent the optimal sequence of splits 
determined from a consideration of all possible remaining splits at each stage of fittmg. 
Splits of binary variables require a single evaluation. SpUts of multi-level categoncal 
variables require evaluations, where k is the number of levels. For example, a 
categorical variable with 3 levels (A, B, and C) would be evaluated at three possible binary 
cuts (A vs BC, AB vs. C, and B vs. AC). Splits of numeric variables must be evaluated 
between each successive pair of ordered observations, a total of n-1 evaluations, where n is 
the number of observations in the node (excluding ties). 

The thoroughness of this approach to model selection can be appreciated by noting 
that, even for the simple example presented above, which included only two variables and 20 
observations, as many as 46 separate evaluations were required to determine the optimal 
model structure. The determination of the best initial split required 26 evaluations: 7 for the 
categorical variable x„ and 19 for the numeric variable x^. The determination of the best 
subsequent spUt of the Xi=AB node required 10 additional evaluations: one to determine 
whether or not to contmue spUtting based on x, (potentially yielding an Xi=A node and m 
X =B node) and nine to evaluate potential spUts based on Xj. And fmally, although the final 
model did not include any subsequent splits of the Xi=CD node, the decision to leave this 
node intact required 10 additional evaluations: one to evaluate a subsequent spUt based on x, 
and nine to evaluate a subsequent split based on x^. 



An Example: The Praxis I Mathematics Item Pool 

The Praxis I: CBT measures the mathematics, reading and writing skiUs of 
prospective teachers during their coUege years. Our example concerns a pool of 510 
mathematics items which were pretested in the Fall and Winter of 1992. The field test was 
structured so that examinees were administered overiapping subsets of items. This was 
accomplished by dividing the original item pool into representative 17-item blocks and 
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administering three blocks of items to each examinee. Urder this design, each examinee 
received 51 items, and each item was administered to approximately 900 examinees. The 
entire pool of 510 items was then calibrated using a 3PL model, fit by means of Mislevy and 
Bock's (1983) BILOG program. A representative subset of 114 items was subsequently 
selected for use in the analysis of items' operating characteristics. This subset included 48 
free-response items and 66 multiple-choice items. 

Item Attributes 

Item attributes were developed by asking members of the ETS Test Development staff 
to list surface features of the items and aspects of the solution process which would be 
expected to contribute to item difficulty. The resulting attribute list included 13 item feature 
variables and 13 solution process variables. Two members of the staff whose duties included 
the writing of similar types of items were then asked to rate each of the items on each of the 
item feature variables and each of the solution process variables. Raters were also asked to 
provide overall ratings of item difficulty expressed on a 1 to 5 scale. Except where noted, 
subsequent analyses are based on the average of the two sets of ratings obtained. 

Information about item content was also available. In particular, each item was 
classified as belonging to one of five content areas: 

A. Number Sense and Operations 

B. Mathematical Relationships 

C. Data Interpretation 

D. Geometry and Measurement 

E. Reasoning 

The item feature vaiiables and the solution process variables are listed in Tables 1 and 
2 along with frequency statistics, rater agreement statistics and correlations with item 
parameters. (Attribute abbreviations are given in parentheses.) Rater agreement was fairly 
high, greater that 90% for all but one of the surface feature variables and averaging about 
82% for the solution process variables. Correlations are reported for all of the items 
combined (n=114) and for subsets defined by content area and response format These 
subsets were suggested by the tree analyses reported below. 

The global judgements of item difficulty provided by the two raters are summarized in 
Table 3. For 92% of the 'tems, the difference between the ratings provided by the two raters 
was less than or equal to one point (on a five point scale). Table 3 also provides correlations 
with item parameters calculated for all of the items combined and for items grouped by 
response format For the set of all items combined, item difficulty was more highly 
correlated with the average difficulty rating than with either of the two individual ratings (.47 
vs .40 or .43). The individual correlations show that the two raters were differentially adept 
at rating items with different response formats. In particular, Rater 1 was more adept at 



rating the free-response items and Rater 2 was more adept at rating the multiple-choice items. 

Insert Tables 1-3 About Here 



Analysis of Item Difficulty 

Our investigation into the components of item difficulty was conducted using a 
combination of tree-based modeling and regression analysis. Tree-based modeling can be 
considered as an exploratory technique for uncoyering structure in data (Qaik & Pregibon, 
1992). In this study, tree models are used to identify important interaction effects, to select 
subsets of yariables for consideration in subsequent regression analyses, and to provide 
graphic displays of the modeling results. 

The tree-based analysis of item difficulty was conducted in stages. The set of 
predictor variables considered in the initial stage of the analysis consisted of all of the item 
attributes described above except for the item difficulty ratings provided by Uie two raters. 
The difficulty rating data was intentionally excluded to avoid swamping the information 
available from the other item attributes. This strategy allowed several interesting features of 
the data to be revealed. 

Most of the attributes considered in this study were originally scored on a binary 
scale. Consequently, the average attribute scores considered in the tree-based analyses were 
specified on a three-point scale: 1 = both raters agreed that the feature was present; 0 = both 
raters agreed that the feature was not present; and 0.5 = the two raters disagreed on whether 
or not the feature was present. Potential spUts of these variables were evaluated twice: once 
with disagreements grouped with Is (feature present); and once with disagreements grouped 
with Os (feature not present). As will be seen, the optimal grouping varied from one attribute 
to another and from one analysis to another. 

Figure 2 provides a graphical representation of a tree model developed to predict item 
difficulty from the surface feature variables and the solution process variables. The predicted 
difficulty value associated with each node can be read from the horizontal axis. The item 
attributes used to define each spUt are indicated on the Unes connecting parents to children. 
SpUt definitions also indicate the optimal treatment of rater disagreement. Split definitions of 
the form "attribute<l" and "attribute=l" indicate that, for that attribute, the disagreement 
items were grouped with the items coded as "feature not present". Split definitions of the 
form "attribute=0" and "attribute>0" indicate that, for that attribute, the disagreement items 
were grouped with the items coded as "feature present". In each case, the grouping selected 
was the one which provided the best prediction. 
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Insert Figure 2 Here 



As can be seen, the first split divides the items into subsets based on values of the 
content area variable: the 61 items classified as content area A (Number Sense and 
Operations) or C (Data Interpretation) are assigned to the left child node; the 53 items 
classified as content area B (Mathematical Relationships), D (Geometry and Measurement), or 
E (Reasoning) are assigned to the right child node. The AC node is subsequently split into 
the .55 items rated as routine applications (N0NR0U<1) and tlie 6 items rated as nonroutine 
applications (N0NR0U=1) indicating that, although AC items are generally easier than BDE 
items, those AC items rated as nonroutine applications are among the most difficult items in 
the pool. Although the routine/nonroutine variable is highly predictive of the difficulty of AC 
items, the fact that it does not appear among the variables selected to define further splits of 
the BDE node indicates that it provides minimal information about gradations of difficulty 
among BDE items. As a matter of fact. Figure 2 shows that there is no overlap whatsoever 
between the subset of variables selected for the prediction model for AC items and the subset 
of variables selected for the prediction model for BDE items! Confirmation of this 
unexpected result can be found in Tables 1 and 2 which provide corrrelation coefficients 
calculated sepc "ately for the AC items and the BDE items. In all but one case, variables 
which are significantly correlated with the difficulty of AC items are not significantiy 
correlated with the difficulty of BDE items and conversely, variables which are significantly 
correlated with the difficulty of BDE items are not significantiy correlated with the difficulty 
of AC items. In addition, the magnitude of the correlations calculated from the appropriate 
subset (either AC items or BDE items) are greater than those calculated from the combined 
set of 1 14 items. The one exception noted concerns the solution process variable "Apply 
standard algorithm in a nonstandard manner". This variable is significantly correlated with 
both types of items but only appears in the tree model estimated for the BDE items. This 
discrepancy can be explained by the correlation of this variable with several of the surface 
feature variables. 

A regression analysis was conducted to evaluate the predictive capability of tlie 
solution process variables (SPVs) and the item feature variables (IFVs). Thirty variables were 
considered in the analysis: (1) all of the SPVs and IFVs with average frequencies of at least 
five (11 SPVs and 9 IFVs); (2) a dummy variable used to distinguish AC items from BDE 
items; and (3) a set of 9 interaction terms. The interaction terms were defined by crossing 
Type=AC with five other variables (Word Problem, Order & Match, Histogram, Nonroutine 
Application, and Recall or Recognize Facts) and by crossing Type=BDE with four other 
variables (Quantitative Comparison, Apply Standard Algorithm, Apply Standard Algorithm in 
Nonstandard Manner, and Apply Multistep Tliinking). The results are presented as Model #1 
in Table 4. Of the 30 variables originally considered, 8 were significant at an alpha level of 
0.15, including four of the interaction terms suggested by the tree-based analysis. The 
estimated eight-variable model accounted for 28% of the variance in item difficulty. 



7 



ERIC 



Insert Table 4 Here 



The analyses described above did not consider the information about item difficulty 
available from the global judgements of item difficulty provided by the two raters. Figure 3 
presents the tree model obtained by adding the average difficulty rating (DR) to the set of 
variables considered previously. As can be seen, the average difficulty rating is now the most 
important predictor, accounting for the first several splits. Note that the average difficulty 
rating has divided the items into three distinct groups: the low group consists of items with 
average ratings between 1 and 2.5 inclusive, the medium group consists of items with average 
ratings between 3 and 4 inclusive, and the high group consists of items witii average ratings 
between 4.5 and 5 inclusive. This grouping is highly correlated with item difficulty: low 
rated items tend to have estimated difficulties below -1.0; medivim rated items tend to have 
estimated difficulties between -1.0 and 0.0; and high rated items tend to have estimated 
difficulties greater tiian 0.0. The most notable exception to this rule occurred for items 
involving a Quantitative Comparison (QC). Figure 3 shows tiiat the difficulty of the QC 
items was consistentiy underrated. In particular, QC items with estimated difficulties in the 
medium range were given low ratmgs and QC items with estimated difficulties in tiie high 
range were given mediimi ratings. 



Insert Figure 3 Here 



Additional evidence of the raters' tendency to underrate tiie difficulty of QC items is 
provided in Figure 4, which depicts tiie least squares regression line estimated firom the entire 
set of 114 items, along witii points representing individual items. (QC items are plotted as 
circles, non-QC itf;ms are plotted as dots.) As can be seen, almost all of the QC items are 
underpredicted. The amount of variation in item difficulty accounted for by the regression on 
average difficulty rating was 21% (Model #2 in Table 4). WTxen tiie regression was rerun 
with tiie QC variable included, the amount of variation accounted for increased to 29% 
(Model #3 in Table 4). 



Insert Figure 4 Here 



An additional analysis was conducted to determine whether any of the other item 
attributes provided improved prediction over and above tiiat provided by the average difficulty 
rating and the QC variable. Using a stepwise procedure, four additional variables were 
selected. The additional variables included two SPVs (Apply Standard Algorithm and 
Translate Words to Symbols); one item feature variable (Histogram); and one interaction term 
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(type=AC crossed with Order and Match). Estimated coefficients are given as Model #4 in 
Table 4. The enhanced model accounted for 36% of the variation in item difficulty. 

For practical applications requiring maximum predictive power, the enhanced model 
(Model # 4 in Table 4) is preferable, since it explains the most amount of variation in item 
difficulty. The other models provide usfjful information about what makes items easy or hare 
In particular, residuals fi^m the analytical model (Model #1 in Table 4) can be consulted for 
clues as to why some items are unexpe<;tedly easy or hard, given the identifiable factors that 
are usually associated with item difficulty. 



Analysis of Item Discrimination 

The tree-based analysis of item discrimination considered all of the item attributes 
described previously. The fitted model is plotted in Figure 5. The first spUt shows that items 
containing equations (EQUA>0) tend to be more discriminating Aan those without, although 
this is not always the case. The most prominent exception occurs for multiple-choice items 
(MC=1) formulated as word problems (WORDP>0) which can not be solved through 
application of a standard algorithm (STDALG<1). The 15 items with this combination of 
attributes were among the most highly discrinoinating in the pool. The plot also shows that 
the least discriminating items were those which did not involve equations (EQUA=0) and 
could be solved through application of a standard algorithm (STDALG=1). 



Insert Figure 5 Here 



A linear prediction model for item discrimination was estimated using a stepwise 
regression procedure. The variables considered in the analysis included aU of die item 
attributes Usted in Tables 1 and 2 with average frequencies of at least five, plus three 
interaction terms suggested by the tree model. The interactions were defined as foUows: 

(1) MC*W0RDP=1 if (MC=1 & WORDP>0), MC*WORDP=0 otherwise; 

(2) NE*W0RDP=1 if (EQUA=0 & WORDP>0), NE*WORDP=0 otherwise; 

(3) NE*STDALG=1 if (EQUA=0 & STDALG=1), NE*STDALG=C otherwise. 

The estimated regression model included one of the item feature variables (EQUA), and two 
of the interaction terms (MC*WORDP and NE*STDALG). As shown in Table 5, EQUA and 
MC*WORDP have positive coefficents and NE*STDALG has a negative coefficient 
Together, tiiese variables account for 12% of the variance in item discrimination. 



Insert Table 5 Here 



Analysis of Item Asymiatotes 

The 3PL asymptote parameter measures the likelihood of responding correctly to an 
item through random guessing. Since the chances of guessing the correct response to a free- 
response item are extremely small, we followed the common practice of setting the asymptote 
parameter equal to zero for all of the free-response items in this study. Consequently, our 
analysis of item asymptotes was confined to the 66 items classified as multiple-choice 
(MC=1). This subset included 41 standard multiple-choice items with 3 or 4 options, and 25 
nonstandard multiple-choice items with varying numbers of options, firom eight to more than 
twenty. (The exact number of options was not tallied for items with more than twenty 
options.) The tree model estimated from this data is plotted in Figure 6. As can be seen, the 
number of choices is the single most important predictor. Items with five or more choices 
have low predicted asymptote values (c«c0.15); items with fewer than five choices have high 
predicted asymptote values (c>0.15). 



Insert Figure 6 Here 



The linear regression models estimated to predict item asymptotes are listed in Table 
6. A model including the single variable, "Number of Choices" accounts for 59% of tiae 
variance. A model including "Number of Choices" and four additional variables accounts for 
85% of the variance. The additional variables include: a dummy variable coded as 1 for 
items with twenty or more options, and zero otherwise; the square cf the number of choices 
variable; and two solution process variables "Apply standard algorithm in a nonstandard 
manner" and "Interpret mathematical vocabulary". The dummy variable was included to 
account for the fact that the No. of Choices variable was truncated at twenty. 



Insert Table 6 Here 



Past efforts to develop prediction models for item asymptotes were significantly less 
successful than the current effort In the analysis of verbal items reported in Mislevy et al. 
(1993), for example, the prediction model for item asymptotes only accounted for 5% of the 
variance. The success of tlae current effort can be attributed to the many different types of 
items included in the Praxis I pool. Whereas most previous analyses have considered 
similarly formatted items (e.g. all four Dice multiple-choice items) the Praxis pool includes 
items with many different formats, from standard 3- or 4-choice multiple-choice items, to 
items requiring the examinee to select a response from a table of more than 20 ntmibers. 
This variation in item format resulted in the large variation in the Number of Choices variable 
which accounted for the high value of obtained. 
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Evaluation of Model Fit 



Predicted values of discrimination parameters, difficulty parameters and asymptotes are 
plotted vs. "true" values in Figure 7. Predicted values ^crc obtained by f'P^y^^^^ 
prediction equations with the highest values of R^ as reported m Tables 4 5 and 6. True 
Lues are Ae parameter estimates obtained in the original caUbration of the entire pool of 
510 items. Free-response items are plotted with a circle; multiple-choice items are plotted 
with a dot. Although considerable variation remains for discrimination p^ameters much of 
the variation in difficulty parameters and asymptotes has been accounted for. In addition, the 
plots show no unusual outliers. 



Insert Figure 7 Here 



Anal ysis nf Difficulty Ra ting Data 

The test developers' global ratings of item difficulty was the single most important 
predictor of item difficulty among all those considered in this study. Because this vanab e 
turned out to be so important, an additional analysis was conducted to detenmne what could 
be learned about the "mental model" mters used to judge item difficulty. Figure 8 presents a 
liee model developed to predict the difficulty rating score from the other item attnbutes. 
Unlike the tree models presented previously, this model was buUt from the raw (unaveraged) 
ratmg data provided by tl,e two raters. The item attributes considered m the analysis included 
all of the item attributes Usted in Tables 1 and 2 with observed frequencies of at least five, a 
variable indicating whether the item was classified as free-response or multiple-choice, a 
variable indicatmg the source of the observation (Rater 1 or Rater 2). and a vanable 
indicating tlie content area covered by the item (A.B.C.D or E). As shown jn Figire 8 
neither the rater identification variable nor the content area vanable were selectai f or the tree- 
based prediction model. The tree also shows that low rated items and high rated items are 
easily identified: low rated items are those that do not involve mutastep tiiinlong 
fMTHINK=:0) and can be solved by recalling or recognizmg facts (RECALL=1). High rated 
items are those that do involve multi-step thinking (MTHINK=1). For items m the middle of 
these two extremes, the picture is more complicated, involving several other item attnbutes. 
This mforaiation may prove useful for future studies designed to refine the attnbute sconng 
procedures. 



Insert Figure 8 Here 
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Discussion 



The tree-based approach described in this paper enabled us to develop a set of linear 
models for predicting the difficulty, discrimination and asymptote parameters of the Praxis I 
mathematics items. Using easily obtainable information about item features and test 
developers' ratings of item difficulty, we were able to explain 36% of the variation in item 
difficulty parameters, 12% of the variation in item discrimination parameters and 85% of the 
variation in item asymptote parameters. This is enough predictive power to be practically 
useful since, as was shown in Mislevy et al. (1993), similar models explaining even less 
variation, when used as prior distributions for item parameters, provided the information 
equivalent of approximately 250 additional pretest calibration subjects. 

The tree-based approach employed in this study contributed to the success of the 
modeling effort in two ways: (1) it helped us to identify several important interaction effects 
which might not otherwise have been identified; and (2) it provided graphical displays of the 
modeling results which helped us to understand and discuss the models. We expect that the 
feedback provided by the tree-based displays will also prove useful in future efforts to refine 
the attribute scoring procedures. 

Due to the limited number of items available, the models developed in this study 
could not be cross-validated. Additional research is needed to validate the model structure 
and to investigate the stability of the estimated parameters. 
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Table 4 

Summary of Item Difficulty Modeling Results: 
Estimated Regression Coefficients and R^ Values 







Alternative 


Models 




Parameter* 


1 


2 


3 


4 


Intercept 


-.158 


-2.146 


-2.489 


-1.899 


Difficulty Rating 




.482 


.542 


A Al 

.497 


Quantitative Comparison 


.403 




.709 


.559 


Apply Std. Algorithm 


-.545 






-.437 


Histogram 


-.971 






-.844 


Order & Match 


1.185 








Translate Words to Symbols 








-.405 


BDE*(NonstdApplication) 


.477 








BDE*(Apply Mul.Thinking) 


.525 








AC*(Oi-der & Match) 


-1.668 






-.601 


AC*(RecaIVRecog. Only) 


-.685 








df 


(8,105) 


(1,112) 


(2,111) 


(6,107) 




.33 


.22 


.30 


.39 


Adjusted R^ 


.28 


.21 


.29 


.36 



a) All regression coefficients were significant at an alpha level of .15. 

The adjusted R^ is corrected for the number of variables in the model. 

AC content areas = Number Sense & Operations & Data Interpretation. 

BDE content areas = Mathematical Relationships, Geometry, Measurement & Reasoning. 




Table 5 



Summary of Item Discrimination Modeling Results: 
Estimated Regression Coefficients and R^ Values 



Alternative 
Model 



Parameter* 


1 


2 


Intercept 


. .928 


.930 


Equation 


.146 


.133 


MC*(Word Problem) 




.159 


NE*(Apply Standard Alg.) 




-.096 


df 


(1,112) 


(3.110) 




.04 


.14 


Adjusted 


.03 


.12 



a) All regression coefficients were significant at an alpha level of .15. 
The adjusted is corrected for the number of variables in the model. 
MC = Multiple Choice Item Format NE = The item does not contain 
an equation or formula. 
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Table 6 



Summary of Item Asymptote Modeling Results: 
Estimated Regression Coefficients and R^ Values 



Alternative 
Models 



Parameter* 


1 


2 


Intercept 


.257 


.553 


No. of Choices 


-.014 


-.108 


Choices>=20? 




-.679 


(No. of Choices)^ 




.006 


Apply Std.Alg. in Nonstd. Manner 




-.063 


Interpret Math. Vocabulary 




.035 


df 


(1,64) 


(5,60) 


R2 


.60 


.87 


Adjusted 


.59 


.85 



a) All regression coefficients were significant at an alpha level of .15. 
The adjusted is corrected for the number of variables in the model. 
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Figure Captions 

Figure 1. A sample tree model for 20 observations. 

Figure 2 . Prediction of item difficulty from solution process variables and item features. 

Figure 3 . Prediction of item difficulty from solution process variables, item features and 
difficulty rating. 

Figure 4 . Relationship of item difficulty to average difficulty rating. 

Figure 5 . Prediction of item discrimination from solution process variables & item features. 
Figure 6 . Prediction of item asymptote from solution process variables & item features. 
Figure 7 . Evaluation of model fit 

Figure 8 . Prediction of difficulty rating from solution process variables and item features. 
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Figure 4 
Relationship of item Difficulty 
To Average Difficulty Rating 
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Figure 7 
Evaluation of Model Fit 
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